Managed Availability <-> SCOM issue's: Get-Serverhealth reports Unknown

Figures:

Exchange Server 2013 SP1 CU7: 8 x MultiRole servers (6 CPU, 24 Gb memory), 92 Mailboxdatabases in one DAG with 3 copies evenly distributed across all 8 servers. 1 MBX Server for 3 recovery Databases.

SCOM 2012 R2 RU4 with latest Exchange MP (december 2014). (being implemented)

Facts:

All (but one) Exchange Servers are continously flipping between Monitored and Not Monitored: between 5 minutes and 1 hour (Ive increased Object Discovery for troubleshooting on all exchange servers)

All (but the same one as above) servers return Unknown on get-healthreport or get-serverhealth when scom status is Not Monitored and return expected output as SCOM reports monitored.

On all (but one) servers MSExchangeHMworker runs high on CPU (~50%) and scom reports Service terminated unexpectacly for this service about 45 times in 48 hour (2 of the servers 200 x in 48 hour)

https://social.technet.microsoft.com/Forums/exchange/en-US/d2200edc-8512-45af-85e4-a2b813b120b9/msexchange-common-event-6003-msexchangehmhost?forum=exchangesvradmin

Ive added a D: drive and the path is created with logfiles. Unfortunately the problem remains.

In Managed Availability eventlogs I see typical behaviour and events (Healthset X determined te be healthy and so on like MA is working as a charm (but querying MA via get-serverhealt doesnt work.)

I know for sure SCOM (and MA) were working fine; Ive seen all 9 servers Monitored (and unhealthy Healthsets from time to time) in mid-to-end 2014 when we run a pilot for scom. If I remember correctly CU4 and CU6 were implemented in the meantime, as far as I know without issues.

We have 2 Management Groups (the pilot group and the production group), both behave the same.

In eventlogs I can find no (obvious ?) events regarding MA not working so we have no idea where to start troubleshooting.

All servers (Vmware) were installed and configured the same way.


  • Edited by JeroenvH1 Tuesday, February 10, 2015 7:03 PM
February 9th, 2015 11:02pm

Hi,

Based on your description, it shows fine when you check the ManagedAvailability log, but you get Unknown stage if you run the get-healthreport or get-serverhealth commands. Only one server works well.

Have you checked the application log and system log on these affected servers?

If you haven't, please check the application log and system log and see if there is any events related to managed availability.

Please make sure the Health Manager service is running. If necessary, please restart this service to check result.

Please run the Get-Mailbox -Monitoring command, if you have issue when you run this command, please recreate the monitoring mailbox to check result.

Besides, please compare the number of files in the following path on affected server with the normal server. Please check if there is any file missing.

c:\program files\microsoft\exchange server\v15\bin\monitoring\config

Hope this is helpful to you.

Best regards,

Free Windows Admin Tool Kit Click here and download it now
February 11th, 2015 9:40am

On all (but one) servers results from get-Healthreport and get-serverhealth vary. Either Unknown or the results as expected.

I couldn't find any suspicious events in application and eventlogs, healh manager is running (and consuming ~50% CPU)

Get-Mailbox -Monitoring gives interesting results:

About 1.100 Healthmailboxes...


  • Edited by JeroenvH1 Wednesday, February 11, 2015 11:48 AM
February 11th, 2015 2:46pm

Hi,

Do you mean you have 100 or 1100 health mailboxes?

Basically, there are two monitoring mailboxes per mailbox database. In your case, you have 98 mailbox databases, you should have 196 health mailboxes.

If possible, please recreate these health mailboxes to check result.

Best regards,

Free Windows Admin Tool Kit Click here and download it now
February 12th, 2015 4:59am

1103 Healthmailboxes indeed...

We're deleting mailboxes per server now. We'll wait and see :)

February 12th, 2015 11:04am

We've deleted the Healthmailboxes:

Now there are ~45 Healthmailboxes per server (12 Active Mailboxdatabases each).

HMWorker is still 'terminated unexpectedly' about every hour

About 10 minutes ago 7 out of 9 Servers returned 'Unknown' for Get-Healthreport, just minutes ago only 3

The Application log reports the following the minute before:

Performance counter updating error. Counter name is Time in Resource per second, category name is MSExchange Activity Context Resources. Optional code: 2. Exception: The exception thrown is : System.InvalidOperationException: Instance 'ad-powershell-defaultdomain' already exists with a lifetime of Process.  It cannot be recreated or reused until it has been removed or until the process using it has exited.

Edit: Reloading the perfmoncounters ( https://support.microsoft.com/kb/2870416?wa=wsignin1.0) did not solve the Applicationlog errors



Free Windows Admin Tool Kit Click here and download it now
February 17th, 2015 2:54pm

After further analysis I found that on all but one server the 'MonitorDefinition' Channel is empty. The Server that holds 'events' in the MonitorDefinition Channel is the server that is the server that reports the expected output from Get-ServerHealth.

How do we fix this ?

10 minutes later: A server that reports as expected hs events in the MonitorDefinition Channel.......


February 18th, 2015 4:29am

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics